{
 "metadata": {
  "name": "",
  "signature": "sha256:e82ca2a940370f81e029d0cff7e2f76c6f54a27cc490cd97b70e303c3f16b937"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Clustering with KMeans in Shogun Machine Learning Toolbox"
     ]
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Notebook by Parijat Mazumdar (GitHub ID: <a href='https://github.com/mazumdarparijat'>mazumdarparijat</a>)"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This notebook demonstrates <a href=\"http://en.wikipedia.org/wiki/K-means_clustering\">clustering with KMeans</a> in Shogun along with its initialization and training. The initialization of cluster centres is shown manually, randomly and using the <a href=\"http://en.wikipedia.org/wiki/K-means%2B%2B\">KMeans++</a> algorithm. Training is done via the classical <a href=\"http://en.wikipedia.org/wiki/Lloyd%27s_algorithm\">Lloyds</a> and mini-batch KMeans method.\n",
      "It is then applied to a real world data set. Furthermore, the effect of dimensionality reduction using <a href=\"http://en.wikipedia.org/wiki/Principal_component_analysis\">PCA</a> is analysed on the KMeans algorithm."
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "KMeans - An Overview"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The <a href=\"http://en.wikipedia.org/wiki/K-means_clustering\">KMeans clustering algorithm</a> is used to partition a space of n observations into k partitions (or clusters). Each of these clusters is denoted by the mean of the observation vectors belonging to it and a unique label which is attached to all the observations belonging to it. Thus, in general, the algorithm takes parameter k and an observation matrix (along with the notion of distance between points ie <i>distance metric</i>) as input and returns mean of each of the k clusters along with labels indicating belongingness of each observations. Let us construct a simple example to understand how it is done in Shogun using the <a href=\"http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CKMeans.html\">CKMeans</a> class. "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let us start by creating a toy dataset."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from numpy import concatenate, array\n",
      "from numpy.random import randn\n",
      "import os\nSHOGUN_DATA_DIR=os.getenv('SHOGUN_DATA_DIR', '../../../data')\n",
      "\n",
      "num = 200\n",
      "d1 = concatenate((randn(1,num),10.*randn(1,num)),0)\n",
      "d2 = concatenate((randn(1,num),10.*randn(1,num)),0)+array([[10.],[0.]])\n",
      "d3 = concatenate((randn(1,num),10.*randn(1,num)),0)+array([[0.],[100.]])\n",
      "d4 = concatenate((randn(1,num),10.*randn(1,num)),0)+array([[10.],[100.]])\n",
      "\n",
      "rectangle = concatenate((d1,d2,d3,d4),1)\n",
      "totalPoints = 800 "
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The toy data created above consists of 4 gaussian blobs, having 200 points each, centered around the vertices of a rectancle. Let's plot it for convenience."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import matplotlib.pyplot as pyplot\n",
      "%matplotlib inline\n",
      "\n",
      "figure,axis = pyplot.subplots(1,1)\n",
      "axis.plot(rectangle[0], rectangle[1], 'o', color='r', markersize=5)\n",
      "axis.set_xlim(-5,15)\n",
      "axis.set_ylim(-50,150)\n",
      "axis.set_title('Toy data : Rectangle')\n",
      "pyplot.show()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "With data at our disposal, it is time to apply KMeans to it using the KMeans class in Shogun. First we construct Shogun features from our data:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from shogun import *\n",
      "\n",
      "train_features = RealFeatures(rectangle)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Next we specify the number of clusters we want and create a distance object specifying the distance metric to be used over our data for our KMeans training:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# number of clusters\n",
      "k = 2\n",
      "\n",
      "# distance metric over feature matrix - Euclidean distance\n",
      "distance = EuclideanDistance(train_features, train_features)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Next, we create a KMeans object with our desired inputs/parameters and train:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# KMeans object created\n",
      "kmeans = KMeans(k, distance)\n",
      "\n",
      "# KMeans training \n",
      "kmeans.train()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now that training has been done, let's get the cluster centers and label for each data point "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# cluster centers\n",
      "centers = kmeans.get_cluster_centers()\n",
      "\n",
      "# Labels for data points\n",
      "result = kmeans.apply()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Finally let us plot the centers and the data points (in different colours for different clusters):"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def plotResult(title = 'KMeans Plot'):\n",
      "    figure,axis = pyplot.subplots(1,1)\n",
      "    for i in range(totalPoints):\n",
      "        if result[i]==0.0:\n",
      "            axis.plot(rectangle[0,i], rectangle[1,i], 'o', color='g', markersize=3)\n",
      "        else:\n",
      "            axis.plot(rectangle[0,i], rectangle[1,i], 'o', color='y', markersize=3)\n",
      "    axis.plot(centers[0,0], centers[1,0], 'ko', color='g', markersize=10)\n",
      "    axis.plot(centers[0,1], centers[1,1], 'ko', color='y', markersize=10)\n",
      "    axis.set_xlim(-5,15)\n",
      "    axis.set_ylim(-50,150)\n",
      "    axis.set_title(title)\n",
      "    pyplot.show()\n",
      "    \n",
      "plotResult('KMeans Results')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<b>Note:</b> You might not get the perfect result always. That is an inherent flaw of KMeans algorithm. In subsequent sections, we will discuss techniques which allow us to counter this.<br>\n",
      "Now that we have already worked out a simple KMeans implementation, it's time to understand certain specifics of KMeans implementaion and the options provided by Shogun to its users."
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Initialization of cluster centers "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The KMeans algorithm requires that the cluster centers are initialized with some values. Shogun offers 3 ways to initialize the clusters. <ul><li>Random initialization (default)</li><li>Initialization by hand</li><li>Initialization using <a href=\"http://en.wikipedia.org/wiki/K-means%2B%2B\">KMeans++ algorithm</a></li></ul>Unless the user supplies initial centers or tells Shogun to use KMeans++, Random initialization is the default method used for cluster center initialization. This was precisely the case in the example discussed above."
     ]
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "    Initialization by hand"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "There are 2 ways to initialize centers by hand. One way is to pass on the centers during KMeans object creation, as follows:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from numpy import array\n",
      "initial_centers = array([[0.,10.],[50.,50.]])\n",
      "\n",
      "# initial centers passed\n",
      "kmeans = KMeans(k, distance, initial_centers)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now, let's first get results by repeating the rest of the steps:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# KMeans training \n",
      "kmeans.train(train_features)\n",
      "\n",
      "# cluster centers\n",
      "centers = kmeans.get_cluster_centers()\n",
      "\n",
      "# Labels for data points\n",
      "result = kmeans.apply()\n",
      "\n",
      "# plot the results\n",
      "plotResult('Hand initialized KMeans Results 1')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The other way to initialize centers by hand is as follows: "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "new_initial_centers = array([[5.,5.],[0.,100.]])\n",
      "\n",
      "# set new initial centers\n",
      "kmeans.set_initial_centers(new_initial_centers)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let's complete the rest of the code to get results."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# KMeans training \n",
      "kmeans.train(train_features)\n",
      "\n",
      "# cluster centers\n",
      "centers = kmeans.get_cluster_centers()\n",
      "\n",
      "# Labels for data points\n",
      "result = kmeans.apply()\n",
      "\n",
      "# plot the results\n",
      "plotResult('Hand initialized KMeans Results 2')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Note the difference that inititial cluster centers can have on final result. "
     ]
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Initializing using KMeans++ algorithm"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In Shogun, a user can also use <a href=\"http://en.wikipedia.org/wiki/K-means%2B%2B\">KMeans++ algorithm</a> for center initialization. Using KMeans++ for center initialization is beneficial because it reduces total iterations used by KMeans and also the final centers mostly correspond to the global minima, which is often not the case with KMeans with random initialization. One of the ways to use KMeans++ is to set flag as <i>true</i> during KMeans object creation, as follows:  "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# set flag for using KMeans++\n",
      "kmeans = KMeans(k, distance, True)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The other way to initilize using KMeans++ is as follows:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# set KMeans++ flag\n",
      "kmeans.set_use_kmeanspp(True)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Completing rest of the steps to get result:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# KMeans training \n",
      "kmeans.train(train_features)\n",
      "\n",
      "# cluster centers\n",
      "centers = kmeans.get_cluster_centers()\n",
      "\n",
      "# Labels for data points\n",
      "result = kmeans.apply()\n",
      "\n",
      "# plot the results\n",
      "plotResult('KMeans with KMeans++ Results')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "To switch back to random initialization, you may use:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#unset KMeans++ flag\n",
      "kmeans.set_use_kmeanspp(False)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Training Methods"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Shogun offers 2 training methods for KMeans clustering:<ul><li><a href='http://en.wikipedia.org/wiki/K-means_clustering#Standard_algorithm'>Classical Lloyd's training</a> (default)</li><li><a href='http://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf'>mini-batch KMeans training</a></li></ul>Lloyd's training method is used by Shogun by default unless user switches to mini-batch training method."
     ]
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Mini-Batch KMeans"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Mini-batch KMeans is very useful in case of extremely large datasets and/or very high dimensional data which is often the case in text mining. One can switch to Mini-batch KMeans training while creating KMeans object as follows:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# set training method to mini-batch\n",
      "kmeans = KMeansMiniBatch(k, distance)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In mini-batch KMeans it is compulsory to set batch-size and number of iterations. These parameters can be set together or one after the other."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# set both parameters together batch size-2 and no. of iterations-100\n",
      "kmeans.set_mb_params(2,100)\n",
      "\n",
      "# OR\n",
      "\n",
      "# set batch size-2\n",
      "kmeans.set_batch_size(2)\n",
      "\n",
      "# set no. of iterations-100\n",
      "kmeans.set_mb_iter(100)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Completing the code to get results:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# KMeans training \n",
      "kmeans.train(train_features)\n",
      "\n",
      "# cluster centers\n",
      "centers = kmeans.get_cluster_centers()\n",
      "\n",
      "# Labels for data points\n",
      "result = kmeans.apply()\n",
      "\n",
      "# plot the results\n",
      "plotResult('Mini-batch KMeans Results')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Applying KMeans on Real Data"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In this section we see how useful KMeans can be in classifying the different varieties of Iris plant. For this purpose, we make use of Fisher's Iris dataset borrowed from the <a href='http://archive.ics.uci.edu/ml/datasets/Iris'>UCI Machine Learning Repository</a>. There are 3 varieties of Iris plants\n",
      "<ul><li>Iris Sensosa</li><li>Iris Versicolour</li><li>Iris Virginica</li></ul>\n",
      "The Iris dataset enlists 4 features that can be used to segregate these varieties, namely\n",
      "<ul><li>sepal length</li><li>sepal width</li><li>petal length</li><li>petal width</li></ul>\n",
      "It is additionally acknowledged that petal length and petal width are the 2 most important features (ie. features with very high class correlations)[refer to <a href='http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.names'>summary statistics</a>]. Since the entire feature vector is impossible to plot, we only plot these two most important features in order to understand the dataset (at least partially). Note that we could have extracted the 2 most important features by applying PCA (or any one of the many dimensionality reduction methods available in Shogun) as well."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "f = open(os.path.join(SHOGUN_DATA_DIR, 'uci/iris/iris.data'))\n",
      "features = []\n",
      "# read data from file\n",
      "for line in f:\n",
      "\twords = line.rstrip().split(',')\n",
      "\tfeatures.append([float(i) for i in words[0:4]])\n",
      "\n",
      "f.close()\n",
      "\n",
      "# create observation matrix\n",
      "obsmatrix = array(features).T\n",
      "\n",
      "# plot the data\n",
      "figure,axis = pyplot.subplots(1,1)\n",
      "# First 50 data belong to Iris Sentosa, plotted in green\n",
      "axis.plot(obsmatrix[2,0:50], obsmatrix[3,0:50], 'o', color='green', markersize=5)\n",
      "# Next 50 data belong to Iris Versicolour, plotted in red\n",
      "axis.plot(obsmatrix[2,50:100], obsmatrix[3,50:100], 'o', color='red', markersize=5)\n",
      "# Last 50 data belong to Iris Virginica, plotted in blue\n",
      "axis.plot(obsmatrix[2,100:150], obsmatrix[3,100:150], 'o', color='blue', markersize=5)\n",
      "axis.set_xlim(-1,8)\n",
      "axis.set_ylim(-1,3)\n",
      "axis.set_title('3 varieties of Iris plants')\n",
      "pyplot.show()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In the above plot we see that the data points labelled Iris Sentosa form a nice separate cluster of their own. But in case of other 2 varieties, while the data points of same label do form clusters of their own, there is some mixing between the clusters at the boundary. Now let us apply KMeans algorithm and see how well we can extract these clusters. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def apply_kmeans_iris(data):\n",
      "    # wrap to Shogun features\n",
      "    train_features = RealFeatures(data)\n",
      "\n",
      "    # number of cluster centers = 3\n",
      "    k = 3\n",
      "\n",
      "    # distance function features - euclidean\n",
      "    distance = EuclideanDistance(train_features, train_features)\n",
      "\n",
      "    # initialize KMeans object\n",
      "    kmeans = KMeans(k, distance)\n",
      "\n",
      "    # use kmeans++ to initialize centers [play around: change it to False and compare results]\n",
      "    kmeans.set_use_kmeanspp(True)\n",
      "\n",
      "    # training method is Lloyd by default [play around: change it to mini-batch by uncommenting the following lines]\n",
      "    #kmeans.set_train_method(KMM_MINI_BATCH)\n",
      "    #kmeans.set_mbKMeans_params(20,30)\n",
      "\n",
      "    # training kmeans\n",
      "    kmeans.train(train_features)\n",
      "\n",
      "    # labels for data points\n",
      "    result = kmeans.apply()\n",
      "    return result\n",
      "\n",
      "result = apply_kmeans_iris(obsmatrix)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now let us create a 2-D plot of the clusters formed making use of the two most important features (petal length and petal width) and compare it with the earlier plot depicting the actual labels of data points."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# plot the clusters over the original points in 2 dimensions\n",
      "figure,axis = pyplot.subplots(1,1)\n",
      "for i in range(150):\n",
      "    if result[i]==0.0:\n",
      "        axis.plot(obsmatrix[2,i],obsmatrix[3,i],'ko',color='r', markersize=5)\n",
      "    elif result[i]==1.0:\n",
      "        axis.plot(obsmatrix[2,i],obsmatrix[3,i],'ko',color='g', markersize=5)\n",
      "    else:\n",
      "        axis.plot(obsmatrix[2,i],obsmatrix[3,i],'ko',color='b', markersize=5)\n",
      "\n",
      "axis.set_xlim(-1,8)\n",
      "axis.set_ylim(-1,3)\n",
      "axis.set_title('Iris plants clustered based on attributes')\n",
      "pyplot.show()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "From the above plot, it can be inferred that the accuracy of KMeans algorithm is very high for Iris dataset. Don't believe me? Alright, then let us make use of one of Shogun's clustering evaluation techniques to formally validate the claim. But before that, we have to label each sample in the dataset with a label corresponding to the class to which it belongs. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from numpy import ones, zeros\n",
      "\n",
      "# first 50 are iris sensosa labelled 0, next 50 are iris versicolour labelled 1 and so on\n",
      "labels = concatenate((zeros(50),ones(50),2.*ones(50)),0)\n",
      "\n",
      "# bind labels assigned to Shogun multiclass labels\n",
      "ground_truth = MulticlassLabels(array(labels,dtype='float64'))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now we can compute clustering accuracy making use of the [ClusteringAccuracy class](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CClusteringAccuracy.html) in Shogun"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from numpy import nonzero\n",
      "\n",
      "def analyzeResult(result): \n",
      "    # shogun object for clustering accuracy\n",
      "    AccuracyEval = ClusteringAccuracy()\n",
      "\n",
      "    # changes the labels of result (keeping clusters intact) to produce a best match with ground truth\n",
      "    AccuracyEval.best_map(result, ground_truth)\n",
      "\n",
      "    # evaluates clustering accuracy\n",
      "    accuracy = AccuracyEval.evaluate(result, ground_truth)\n",
      "\n",
      "    # find out which sample points differ from actual labels (or ground truth)\n",
      "    compare = result.get_labels()-labels\n",
      "    diff = nonzero(compare)\n",
      "    return (diff,accuracy)\n",
      "\n",
      "(diff,accuracy_4d) = analyzeResult(result)\n",
      "print('Accuracy : ' + str(accuracy_4d))\n",
      "\n",
      "# plot the difference between ground truth and predicted clusters\n",
      "figure,axis = pyplot.subplots(1,1)\n",
      "axis.plot(obsmatrix[2,:],obsmatrix[3,:],'x',color='black', markersize=5)\n",
      "axis.plot(obsmatrix[2,diff],obsmatrix[3,diff],'x',color='r', markersize=7)\n",
      "axis.set_xlim(-1,8)\n",
      "axis.set_ylim(-1,3)\n",
      "axis.set_title('Difference')\n",
      "pyplot.show()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In the above plot, wrongly clustered data points are marked in red. We see that the Iris Sentosa plants are perfectly clustered without error. The Iris Versicolour plants and Iris Virginica plants are also clustered with high accuracy, but there are some plant samples of either class that have been clustered with the wrong class. This happens near the boundary of the 2 classes in the plot and was well expected. Having mastered KMeans, it's time to move on to next interesting topic.  "
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "PCA as a preprocessor to KMeans"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "KMeans is highly affected by the <i>curse of dimensionality</i>. So, dimension reduction becomes an important preprocessing step. Shogun offers a variety of [dimension reduction techniques](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CDimensionReductionPreprocessor.html) to choose from. Since our data is not very high dimensional, PCA is a good choice for dimension reduction. We have already seen the accuracy of KMeans when all four dimensions are used. In the following exercise we shall see how the accuracy varies as one chooses lower dimensions to represent data. "
     ]
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "1-Dimensional representation"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let us first apply PCA to reduce training features to 1 dimension"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from numpy import dot\n",
      "\n",
      "def apply_pca_to_data(target_dims):\n",
      "    train_features = RealFeatures(obsmatrix)\n",
      "    submean = PruneVarSubMean(False)\n",
      "    submean.init(train_features)\n",
      "    submean.apply_to_feature_matrix(train_features)\n",
      "    preprocessor = PCA()\n",
      "    preprocessor.set_target_dim(target_dims)\n",
      "    preprocessor.init(train_features)\n",
      "    pca_transform = preprocessor.get_transformation_matrix()\n",
      "    new_features = dot(pca_transform.T, train_features)\n",
      "    return new_features\n",
      "\n",
      "oneD_matrix = apply_pca_to_data(1)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Next, let us get an idea of the data in 1-D by plotting it."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "figure,axis = pyplot.subplots(1,1)\n",
      "# First 50 data belong to Iris Sentosa, plotted in green\n",
      "axis.plot(oneD_matrix[0,0:50], zeros(50), 'o', color='green', markersize=5)\n",
      "# Next 50 data belong to Iris Versicolour, plotted in red\n",
      "axis.plot(oneD_matrix[0,50:100], zeros(50), 'o', color='red', markersize=5)\n",
      "# Last 50 data belong to Iris Virginica, plotted in blue\n",
      "axis.plot(oneD_matrix[0,100:150], zeros(50), 'o', color='blue', markersize=5)\n",
      "axis.set_xlim(-5,5)\n",
      "axis.set_ylim(-1,1)\n",
      "axis.set_title('3 varieties of Iris plants')\n",
      "pyplot.show()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let us now apply KMeans to the 1-D data to get clusters."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "result = apply_kmeans_iris(oneD_matrix)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now that we have the results, the inevitable step is to check how good these results are. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "(diff,accuracy_1d) = analyzeResult(result)\n",
      "print('Accuracy : ' + str(accuracy_1d))\n",
      "\n",
      "# plot the difference between ground truth and predicted clusters\n",
      "figure,axis = pyplot.subplots(1,1)\n",
      "axis.plot(oneD_matrix[0,:],zeros(150),'x',color='black', markersize=5)\n",
      "axis.plot(oneD_matrix[0,diff],zeros(len(diff)),'x',color='r', markersize=7)\n",
      "axis.set_xlim(-5,5)\n",
      "axis.set_ylim(-1,1)\n",
      "axis.set_title('Difference')\n",
      "pyplot.show()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "2-Dimensional Representation"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We follow the same steps as above and get the clustering accuracy."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "STEP 1 : Apply PCA and plot the data (plotting is optional)"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "twoD_matrix = apply_pca_to_data(2)\n",
      "\n",
      "figure,axis = pyplot.subplots(1,1)\n",
      "# First 50 data belong to Iris Sentosa, plotted in green\n",
      "axis.plot(twoD_matrix[0,0:50], twoD_matrix[1,0:50], 'o', color='green', markersize=5)\n",
      "# Next 50 data belong to Iris Versicolour, plotted in red\n",
      "axis.plot(twoD_matrix[0,50:100], twoD_matrix[1,50:100], 'o', color='red', markersize=5)\n",
      "# Last 50 data belong to Iris Virginica, plotted in blue\n",
      "axis.plot(twoD_matrix[0,100:150], twoD_matrix[1,100:150], 'o', color='blue', markersize=5)\n",
      "axis.set_title('3 varieties of Iris plants')\n",
      "pyplot.show()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "STEP 2 : Apply KMeans to obtain clusters"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "result = apply_kmeans_iris(twoD_matrix)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "STEP 3: Get the accuracy of the results"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "(diff,accuracy_2d) = analyzeResult(result)\n",
      "print('Accuracy : ' + str(accuracy_2d))\n",
      "\n",
      "# plot the difference between ground truth and predicted clusters\n",
      "figure,axis = pyplot.subplots(1,1)\n",
      "axis.plot(twoD_matrix[0,:],twoD_matrix[1,:],'x',color='black', markersize=5)\n",
      "axis.plot(twoD_matrix[0,diff],twoD_matrix[1,diff],'x',color='r', markersize=7)\n",
      "axis.set_title('Difference')\n",
      "pyplot.show()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "3-Dimensional Representation"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Again, we follow the same steps, but skip plotting data."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "STEP 1: Apply PCA to data"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "threeD_matrix = apply_pca_to_data(3)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "STEP 2: Apply KMeans to 3-D representation of data"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "result = apply_kmeans_iris(threeD_matrix)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "STEP 3: Get accuracy of results. In this step, the 'difference' plot positions data points based petal length \n",
      "    and petal width in the original data. This will enable us to visually compare these results with that of KMeans applied\n",
      "    to 4-Dimensional data (ie. our first result on Iris dataset)"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "(diff,accuracy_3d) = analyzeResult(result)\n",
      "print('Accuracy : ' + str(accuracy_3d))\n",
      "\n",
      "# plot the difference between ground truth and predicted clusters\n",
      "figure,axis = pyplot.subplots(1,1)\n",
      "axis.plot(obsmatrix[2,:],obsmatrix[3,:],'x',color='black', markersize=5)\n",
      "axis.plot(obsmatrix[2,diff],obsmatrix[3,diff],'x',color='r', markersize=7)\n",
      "axis.set_title('Difference')\n",
      "axis.set_xlim(-1,8)\n",
      "axis.set_ylim(-1,3)\n",
      "pyplot.show()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Finally, let us plot clustering accuracy vs. number of dimensions to consolidate our results."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from scipy.interpolate import interp1d\n",
      "from numpy import linspace\n",
      "\n",
      "x = array([1, 2, 3, 4])\n",
      "y = array([accuracy_1d, accuracy_2d, accuracy_3d, accuracy_4d])\n",
      "f = interp1d(x, y)\n",
      "xnew = linspace(1,4,10)\n",
      "pyplot.plot(x,y,'o',xnew,f(xnew),'-')\n",
      "pyplot.xlim([0,5])\n",
      "pyplot.xlabel('no. of dims')\n",
      "pyplot.ylabel('Clustering Accuracy')\n",
      "pyplot.title('PCA Results')\n",
      "pyplot.show()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The above plot is not very intuitive theoretically. The accuracy obtained by using just one latent dimension is much more than that obtained by taking all four features features. A plausible explanation could be that the mixing of data points from Iris Versicolour and Iris Virginica is least along the single principal dimension chosen by PCA. Additional dimensions only aggrevate this inter-mixing, thus resulting in poorer clustering accuracy. While there could be other explanations to the observed results, our small experiment has successfully highlighted the importance of PCA. Not only does it reduce the complexity of running KMeans, it also enhances results at times."
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "References"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "[1]  D. Sculley. Web-scale k-means clustering. In Proceedings of the 19th international conference on World wide web, pages 1177\u20131178. ACM, 2010\n",
      "\n",
      "[2] Bishop, C. M., & others. (2006). Pattern recognition and machine learning. Springer New York.\n",
      "\n",
      "[3] Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science"
     ]
    }
   ],
   "metadata": {}
  }
 ]
}
